Thank you.
Thank you.
Yeah, well, it's quite crowded considering also the online participants more than I anticipated.
Welcome.
Yeah, I want to talk about the importance of workflow engines in contemporary data analysis
and end up the title for admins because I anticipated that some of my fellow administrators
and service providers are along with us and HPC users or data analysts.
I'm from the NHR Southwest in Renham, Palatinate as Gere already mentioned and I also represent
the SnakeMate Teaching Alliance, a alliance of some fellow developers.
All right, I want to introduce you to the reason why we ought to use workflow managers
and not necessarily SnakeMake alone. There are, of course, various alternatives.
And then I'll introduce SnakeMake and its capabilities, how we handle software and what's
in there for administrators and users. I'll show you how this can be launched when we get to work
and a little bit about the details with regard to our workflow parameterization and how we can get
from a generic workflow, which runs on a desktop or a laptop onto an HPC system.
So the first part is, I think it's trying to, when they teach HPC 101, they teach the best
batch systems to their users. And I want to put a question mark behind that. Also, I want to
illustrate the benefits of a workflow system for administrators and the difference between a
workflow system and a pipeline and thereby introduce a little bit of SnakeMaking.
Now, data analysis can be quite easy and can be conducted on a simple computer. However,
in the light of rather big amount of data, this can be quite frustrating.
And just learning Slur might not be the last and wisest move. And I'll illustrate here why.
But first, you have to understand when we talk about data analysis, there are constituents,
which I think, and if you back to there, just interrupt me, are always or almost always present.
That is quality control, processing, presumably multiple process steps in processing. Then you do
some summary statistics and then you want to publish. Therefore, you need to visualize,
plot something statically or interactively. Not all such steps, if we be honest, are HPC worthy.
Because for instance, if you do an internet download or move around and tinker with data,
if we get this to your program, that's something you can do on any computer. That's not what you do
on an HPC system, not necessarily. But I want to argue that we can actually do that without
the great harm. For that, I want to introduce you to a duck. I know that many of you are compute
scientists, you know that. A duck is a directed acyclic graph and that's an entity with which we
can display any data analysis. When I talk about you always see these little boxes.
And you do not need to understand that here is a simple workflow. So first, for instance, in
linguistics, we can count words and make a plot about this to put this to a statistical test
that we can archive that these are just not necessarily HPC jobs, but jobs, something which
is carried out by a workflow manager or by the scientist doing this kind of analysis. Then
a full duck also takes into account that you have presumably multiple samples. Here in this
little linguistics example where we count words in books, we might have several books and then we
count this in several times and subject to plotting and one come test and one archive.
And it can get extremely complex. I'll show you. If there are any questions, don't hesitate
interrupting. So this is what you need to have as a background. Just basically every such box,
if you can read it or not, doesn't matter. Every such book stands for a job. So I want to give you
an example. This here is a workflow. What it does is simply some proteotranscomics. So here we do
some mass spec data analysis and we subject this to some genome assembly and annotation. That's
something which we can conduct on an HPC system and it's quite HPC worthy because it's extremely
compute intensive. Hence, what does it take to run this on an HPC cluster? And I can be proud of
myself because I've been teaching all these intro courses for quite a while and one of those students
got to work and implemented this very workflow with bash and
slurm commands alone. I'll show you how it looks like. So these are all the files. Here we go.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:53:07 Min
Aufnahmedatum
2025-01-17
Hochgeladen am
2025-01-17 18:46:03
Sprache
en-US
Speaker: Dr. Christian Meesters, Johannes Gutenberg University Mainz
Date: January 14, 2025
Abstract:
This talk highlights the benefits of using workflow management systems, with a focus on Snakemake, for multistep data analysis on high-performance computing (HPC) clusters. It shows how workflows can streamline research by automating tasks, managing software environments (e.g., Conda, containers, module files), and handling HPC-specific requirements like resource allocation and job submission. We introduce the Snakemake workflow catalog, a resource for prebuilt workflows to save time and avoid reinventing the wheel. Parameterization enables workflow flexibility and scalability. Finally, the talk will explore how Snakemake facilitates reproducibility, from deployment to comprehensive workflow reports with execution statistics and publication-ready outputs.
Material from past events is available at: https://hpc.fau.de/teaching/hpc-cafe/